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Abstract 

The convergence behavior of gradient methods for minimizing convex differentiable functions is one 
of the core questions in convex optimization. This paper shows that their well-known complexities can 
be achieved under conditions weaker than the commonly accepted ones. We relax the common gradient 
Lipschitz-continuity condition and strong convexity condition to ones that hold only over certain line 
segments. Specifically, we establish complexities 0(-^) an d 0(^J~^) for the ordinary and accelerate 
gradient methods, respectively, assuming that V/ is Lipschitz continuous with constant R over the 
line segment joining x and x — -gV/ for each x £ dom/. Then we improve them to log(-j-)) and 
0(y^log(i)) for function / that also satisfies the secant inequality (Vf(x),x — x*) > v\\x — x*\\ 2 for 
each x £ dom/ and its projection x* to the minimizer set of /. The secant condition is also shown 
to be necessary for the geometric decay of solution error. Not only are the relaxed conditions met by 
more functions, the restrictions give smaller R and larger v than they are without the restrictions and 
thus lead to better complexity bounds. We apply these results to sparse optimization and demonstrate 
a faster algorithm. 

Keywords: sublinear convergence, linear convergence, restricted Lipschitz continuity, restricted strong 
convexity, Nesterov acceleration, restart technique, skipping technique, sparse optimization. 

1 Introduction 

Owing much to the fast development in signal/image processing, compressive sensing, statistical and machine 
learning, and parallel computing, we have witnessed the (revived) popularity of gradient methods, which are 
easy to program, have relatively low per-iteration complexities, and are often among the best options for 
obtaining moderately accurate solutions for large-scale optimization problems. 
This paper considers the convex unconstrained optimization problem: 

/* := min f(x) (1) 

where / : R™ — > R is a differentiable convex function. We assume throughout the paper that the set of 
optimal solutions X* is nonempty and closed and thus /* £ R is attainable. For simplicity, we assume 
dom/ = R™. Most of the discussions in this paper hold if we impose x £ dom/ rather than x £ R™. 
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The gradient descent iteration is 

x (k+i) = x (k) _ T y/(x( fc )). (2) 

Its convergence rates have been established for two major classes of functions [6l[7j|8j: The first class, denoted 
by ,Fl(R™), consists of the convex functions with Lipschitz continuous gradients, namely, 

/ G Jz^K' 1 ) / is diffcrcntiable and 

\\Vf(x)-Vf(y)\\<L\\x-y\\, Wx,yeR n , (3) 

where L > is the Lipschitz constant of V/; the second class, denoted by Sl^R"), is a subclass of J-l^W 1 ) 
in which the functions are also strongly convex, namely, 

/ € ^(R") ^ f G -Fl(R b ) and 

<yf(x)-Vf{y),x-y)>n\\x-y\\ 2 , Vx,j/GR n , (4) 

where /i > is the convex modulus of /. Geometrically, if / G J-l, V/ cannot change too quickly; the 
curvature of / (assuming / G C 2 ) is upper bounded by L. If / G S^.l, V/ cannot change too slowly either; 
the curvature of / (assuming / G C 2 ) is lower bounded by jx. One might be more familiar certain equivalent 
conditions of © and (QJ. 
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Table 1: Complexities of minimizing a convex differentiable function to e- accuracy 

For any / G Fl, iteration ((2]) reduces f k = f(x^ k ') at the rate of 0(t)> hence, it takes O(j) iterations 
to guarantee f k < f* + e. For any / G S^l, the rate is improved to 0(j^j^) 2k . Therefore, it only takes 
0(|f log(|)) iterations. 

In the seminal paper [5], Nesterov presents an accelerated gradient descent iteration. For functions in Tl, 
its complexity is 0(y~^). In papers [7J|5], he generalizes the method to more function classes. In particular, 

if / G £vt,L, the complexity is 0(y^log(i)). He gives examples of functions on which no gradient-based 
methods can perform fundamentally better. So, his method has the optimal worst-case complexities; for 
more detail, see book [5] . The complexities discussed above are summarized in Table [TJ 



1.1 Contributions 

We show that global Lipschitz continuity of V/ is not necessary for deriving the sublincar bounds in Table 
[TJ If V/ is Lipschitz continuous with constant R > restricted to the line segments joining x and x — 
(l/R)Vf(x), for x = x(°\ x^ , . . ., or simply x G K™, then the ordinary and accelerated gradient descent 
methods have complexities 0{R/e) and 0(y / R/e), respectively. We believe that some researchers, especially 
those who study line search methods, might be aware of this result though we do not find it in the literature. 
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Our analysis in fact hints a backtracking line search method that achieves the same complexities without 
the knowledge of R. It is worth noting that the recent paper QT] presents a skillful line search method that 
improves the Nesterov's accelerated gradient method. 

On the other hand, the Lipschitz continuity of V/ alone gives at best the rather weak 0(1/ e) and 0(1/ ^fe) 
complexities. It is commonly know that the strong convexity of / enables the much better complexity of 
0(log(l/e)). However, most convex functions are not strongly convex. Hence, it is interesting to relax 
the conditions and still establish a linear convergence rate. We show that an inequality resembling (j4} but 
concerning just the secant between x and its projection to X * is ultimately responsible for linear convergence. 
The inequality imposes a positive lower bound on the average curvature between x and the solution set and 
is shown to be both sufficient and necessary for the geometric decay of solution error. 

1.2 Outline of the paper 

The rest of the paper is organized as follows. Section [2] defines new properties of functions along with 
examples and discussions. Section [3] describes the convergence and complexity results. Section |4] applies 
these results to the augmented l\ model and presents numerical results of sparse signal recovery. Finally, 
Section O concludes this paper. 

2 Weakened conditions 

For any two vector u,v £ R™, we let the set of points on the line segment between u and v be denoted by 
[u,v\, i.e., 

[u, v\ = {w £ R' 1 : w = Xu + (1 - X)v, < A < 1}. 

Definition 1 (Restricted Lipschitz-continuous gradient - RLG(i?)). A function f(x) : R™ — > K has a 
restricted Lipschitz-continuous gradient (RLG) with constant R > if it is differentiate and obeys 

\\Vf{x)-Vf(y)\\<R\\x-vl y(x,y)£ft, (5) 

where 

= \J{(x,y):x,y£[z,z- (l/R)Vf(z)\ }. (6) 

This definition requires V/ not to change too quickly over the specified downhill line segments ([6]). 
Constant R can generally be smaller than the global Lipschitz constant L. 

Definition 2 (Restricted secant inequality - RSI(i/)). A function f(x) : W L — > M satisfies the restricted 
secant inequality (RSI) with constant v > if it is differentiate and obeys 

(Vf(x)-Vf(x pI j),X-X pli ) > v\\x - XprjH 2 , (7) 

where x pT j = Proj^, (a;) is the projection of x onto the solution set X* . Such f is called an RSI function. 

Note that V/(a;p r j) = by definition. Constant v can be viewed as a lower bound of the average curvature 
of / between x and x pl j . Since the goal of minimization is to reach the solution set X* , in order to have linear 
convergence, it turns out only the "average minimum curvature" between the current x and its projection 
x pr j matters. Using RSI, we introduce restricted strongly convex (RSC) functions. 

Definition 3 (Restricted strong convexity - RSC(f)). A function f(x) : R" — > R is restricted strongly 
convex with constant v > if it is convex, has a finite minimizer, and satisfies RSI(v). 
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RSC is weaker than strong convexity as ([7]) is a relaxation to inequality Some of our convergence 
results will be given for the following new classes of functions. 

Definition 4 (New function classes). Let R, v > 0. Define function classes 



K R>U ( 



= {/ : R n -> R | / is convex and RLG(R)}, 
= {/ g C R (R n ) | / is RSC{v)}, 
= {/ € I / is RSC(v)}. 



By definition, if fi > v and L = R, then we have 

^ ( } c ^,,(R«)c^(M«) c £fl(M } - 

Definition [3] is different from another recent definition of restricted strong convexity from [5] . 

Definition 5 (Restricted strong convexity of [5]). A function f(x) : R™ — > R satisfies the restricted strong 
convexity at xq with constants K\, k,2 > anrf tolerance function r(x) if it is differentiate and 

f(x + S)- f(x ) - (/'(.to), S) > kMI 2 - K 2 (r(x )) 2 , (8) 
for all 5 G C , where C is a certain point set. 

Definition [S] is a local and weakened version of strong convexity. With r(x) =0 and C = R", it reduces 
to the standard strong convexity. 

Many of the recent algorithms for sparse optimization are observed to converge quickly, at least on prob- 
lems that are not severely "ill-conditioned" ; however, their underlying objective functions are not strongly 
convex - a property commonly used to ensure global linear convergence. When A has more columns than 
rows, a function in the form of g(Ax — b), even with a strongly convex function g, is "flat" along many 
directions. Gradients along these directions are small, so minimization can progress very slowly. However, in 
problems with certain types of A and an additional rcgularization function r{x) such as the £i-norm, moving 
along these directions will significantly change r{x). We believe this has the definition of restricted strong 
convexity in pQ, which extends the ordinary definition by including the relaxation term involving r(x). That 
paper argues that, with high probability for problems with A that is random or satisfies certain restricted 
eigenvalue properties, Definition [S] is satisfied by f{x) = g(Ax — b) + r(x), and as a result, the prox-linear or 
gradient-projection iteration has a (nearly-) linear convergence behavior, specifically, 

U^fc+l) _ x *^2 < c fe|| x (0) _ x *^2 + Q ^ x * _ x o||2 ); 

where c < 1, x* and x° are the minimizer and underlying true signal, respectively, and x^ stands for the 
kth iterate. Our paper focuses on the minimization of convex differentiable functions in the general setting 
and establishes unmodified sublincar and linear convergence without a probabilistic argument. 

2.1 Properties 

This subsection gives the core lemmas for establishing the main convergence results. 

Lemma 1. Let X* be the nonempty solution set of {T]). If f € £#(R") with R > 0, then we have 
1) For any (x,y) € given in it holds 

f(y)~f(x)-(Vf(x),y-x)<^\\x-y\\*; (9) 
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2) For any y € X* , it holds 

^\\Vf(x)\\ 2 <{Vf(x),x-y). (10) 
Proof. For any (x, y) € il, © follows from 

f(y) = f(x)+ f (Vf(x + r(y-x)),y-x)dT 

JO 

= f(x) + (Vf(x), y - x) + [ (Vf(x + r(y - x)) - Vf(x), y - x)dr 

Jo 

<f(x) + (Wf(x),y-x)+ f \\Vf(x + r(y-x))-Vf(x)\\\\y-x\\dr 

Jo 

<f(x) + (Vf(x),y-x) + ^\\x-y\\ 2 , 

where the first inequality follows from the Cauchy-Schwartz inequality and the second one follows from the 
definition of RLG. For part 2), for any y € X* we have 

r = m < - R-^m) 

< fix) + (V/(x), (x ~ R^Vfix)) -x) + - R^Vfix)) - x\\ 2 
= f(x)-(2R)- 1 \\Vf(x)\\ 2 , 
where the second inequality follows from part 1). Therefore, we have 

^IIV/^H 2 < f(x) - f(y) < (Vf(x),x- y), 

where the second inequality utilizes the convexity of /. □ 

Note that for general y, the inequality (fTU| does not hold. For example, setting y = x + rf\/ f(x) with 
< V < h, S ives ( V /(z), x-y) = -V- \\Vf(x)\\ 2 < 2^||V/(x)|| 2 . 

Lemma 2. Let X* be the nonempty solution set of (UJ). If f € TZR^(R. n ) with R > and v > 0, then for 
every 8 G [0,1] the following holds: 

(Vf(x) - Vf(x wi ),x - x prj ) > ^l|V/(x) - V/(x prj )|| 2 + (1 - 6)v\\x - x prj || 2 , (11) 
where x pl} is the projection of x onto the solution set X* . 

Proof. Obviously, x pr j € X* and V/(x pr j) = 0. Thus, from part 2) of Lemma [TJ we have 

(Vf(x) - Vf(x pii ),x- x prj ) > ^||V/(x) - V/(x prj )|| 2 . (12) 
On the other hand, from the definition of RSC(^), we obtain 

(V/(a;) - Vf(x pii ),x - x pT j) > v\\x - x prj || 2 . (13) 
Inequality ([TT]) follows from ([12]) and dT3j) . □ 

Parameter 6 in ([TT]) will be optimized to obtain a convergence bound. 
Lemma 3. Let f(x) satisfy RSI(v), v > 0, and X* be the nonempty solution set. For Vx £ M m we have 

f(x)-f(x pYi )>^\\x~x pii \\ 2 , (14) 



where x pr j is the projection of x onto the solution set X* . 



Proof. Since for any r e [0, 1] point y T = x pr j + t(x — x pT j) € [x,x pi: j\ projects to X* at x pr j, we have 

f(x) = /(Xp r j) + / (V/(.T prj + T(x - X pI j)),X - X pI j)dT 

Jo 
r 1 i 

= /0%rj)+ / -(V/(x prj +T{ X -X pI j))- \7f{x pI j),T(x - X pTj )}dT 

Jo T 



>f(Xpxs)+ / ~ VT Ik-^prjll dr 
Jo T 

= /OVj) + 2 II 1 '- X pT}f 

where f|l 5b[) follows from V/(x pr j) = and (|15c[) from RSI(t 



(15a) 

(15b) 

(15c) 
(15d) 
□ 



It is worth noting that since x pr j is restricted, inequality (|14l) does not mean that / grows everywhere 
quicker than the quadratic function q[x) = ^\\x — x pr j|| 2 . 



2.2 Examples of RSI and RSC functions 




-> x 




-> X 



(a) (b) 

Figure 1: Non-convex functions satisfying RSI 

Examples 1 and 2 below are non-convex and probably of no practical use. However, they illustrate that 
RSI inequality (0 imposes a "minimum average curvature" of / between x and x pr j , and unlike (j4j , it alone 
does not guarantee convexity. Hence, the RSC definition must explicitly include convexity. 



Example 1 (Figure 1(a) RSI and non-convex 



h{x) = 



1 - VI -x 2 , 



l + Vl-(*-2) 2 , 



x<0, 

< x < 1, 

1 < x < 2 - 



V2 



z>2-#. 



(16) 



fi is non-convex, and its minimizer set is {— oo,0]. Since f[(x) — > +oo as x 
continuous, fi satisfies RSI(v) with v = - 2 ^ = mm x >Q f[{x)/x. 



1, f[ is not Lipschitz 
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Example 2 (Figure 1(b) RSI and non-convex) 



f2(x) 



0, 

i - VT^2, 



l-(z-\/2) 2 -V2 + l, 



\/2-l \2 
2 



5-5\/2 



ir < 0, 
< x < 
^<x<l, 
x > 1. 



(17) 



/2 is non-convex, and its minimizer set is (— oo,0]. Unlike fx, max I >o V ^^^ is finite and thus $2 has a 
Lipschitz continuous gradient. /2 satisfies RSI(v) with v = \J v ^ 1 = min x >o f2(x)/x. 

Examples 3 and 4 below explain that RSC and strict convexity do not contain each other, and strong 
convexity is strictly included in their intersection. Recall that a function / is strictly convex if f(ax + (1 — 
a)y) < af(x) + (1 — a)f(y) for any x ^ y and a £ (0, 1). 

y 









p 



Figure 2: RSC but not strictly convex 

Example 3 (Figure [5J RSC but not strictly convex). Let x G M., (3 > and define 

shrink ( g(x) = sign(x) max{|a;| — /3, 0}, (18) 
f3(x) = -|| shrink^ (x) \\ 2 

/ is not strictly convex since f3(x) = for x G X* = [— f3,f3], which is its minimizer set. On the other hand, 
f 3 (x) = (l/2)\\x- /3\\ 2 for x>(3 and f 3 (x) = (l/2)\\x + /3\\ 2 for x < fi, so f 3 is RSC(v) with v=\. 

Example 4 (Strictly convex, but not RSC). Functions fix) = x 4 and f(x) = e x are strictly convex but not 
RSC. In particular, f(x) = e x does not have a minimizer though it is lower bounded by 0. 

Motivated by the above examples, we can divide convex diffcrcntiable functions into subclasses of RSC, 
strictly convex, and strongly convex functions depicted in Figure [3] Strictly and strongly convex functions 
do not need to be diffcrcntiable. Although our definition of RSC can be generalized for non-diffcrcntiablc 
functions through their subdiffcrentials, we keep it simple as is. 

Example 5 (Dual objective of augmented t\ model). Let A £ R mxn . The Lagrange dual problem to 

\ x \\ l + -L\\ x f:Ax = b\ (19) 
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Figure 3: Classes of convex differentiable functions 

is 

max/(y) = b T y- £|| shrink! (A T y)\\ 2 , (20) 
v I 

where shrinki(z) is given in (|18l) . Provided that Ax = b is consistent, |3] shows that — / is RSC(v) with 
v > 0. (See Lemma 7 of [4] /or an explicit lower bound off). 

Admittedly, establishing RSC and deriving a bound for v are not straightforward as they typically involve 
projection to the minimizer set X* , which may not be easy to analytically derive. On the other hand, we 
have to live with RSC as we will show later that it is both sufficient and necessary. Next we present a useful 
result for certain composite functions. 

Theorem 1 (Linear composition). Let g € TZl^O^" 1 )- If 9 has a unique minimizer y* and matrix A £ R mx " 
(to < n) has full row-rank, then function f(x) = g(Ax) is RSC. Specifically, 

f(x) e n L ,-M m ), (21) 

where L = L||^4|| 2 and v = i^A m i n (j4j4 T ). 

Applying this theorem, any strongly convex function g with Lipschitz continuous gradient satisfies the 
condition of Theorem [T] and thus f(x) = g(Ax) is RSC if A has full row-rank though / is generally not 
strongly convex. (/ will be strongly convex if A has full column-rank, following a standard argument). 
f(x) = g(Ax) arises in various applications including examples in convex quadratic minimization, statistical 
regression, routing problems in data networks, and many others. 

Proof of Theorem^ For any x, y £ K™, we have 

||V/(x) - V/(y)|| = \\A T Vg{Ax) - A T Vg(Ay)\\ < L\\A\\\\A(x - y)\\ < (L\\A\\ 2 )\\x - y\\, 
which means / € T^. By definition, the minimizer set of / is 

X* = {x G M™ : Ax = y*}, 
which is nonempty since A has full row-rank. The projection of any x € K n to X* is 

.T prj = x + A T (AA T )- 1 (y* - Ax). 
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Since Vf(x) = A T Vg(Ax), we 

{Wf{x),x- z prj ) = (Vg(Ax) - Vg(Ax pii ),Ax - Ax w] ) > v\\A{x - x pij )\\ 2 > (i/A min (Ayl T ))||a; - ^ prj || 2 . 
where the first inequality follows from g G TZl,u and the second one from x — x pl j G Range(A T ). □ 

2.3 Convex conjugacy 

The conjugate of convex function / is 

r(y):= S u V {(y,x)-f(x)}. (22) 

X 

A duality relation can be obtained between RLG and RSC, in analogy to the well-known results that a convex 
function / is diffcrcntiablc and V/ is Lipschitz-continuous with constant L if and only if /* is strongly convex 
with constant 1/L. In this subsection, we consider non-diffcrcntiable functions to present our result (while 
we restrict ourselves to diffcrcntiablc functions in other sections). 

Definition 6. Let f be a convex function. We say that f has restricted Lipschitz subgradicnts if there exists 
L > such that for any x =/= 0, 

L(P~ Q,x) > \\p- qf, Vpedf(x), q = Pmj df{0} (p). 

Definition [5] applies to non-diffcrentiable functions while the usual Lipschitz continuity of gradient of 
course requires differentiability. In Example [5j the primal objective (|19[) is non-differentiable but satisfies 
Definition [B] with L = or 1 . 

Theorem 2. Let f be a strictly convex function and G dom/. / has restricted Lipschitz subgradients with 
constant L > if and only if f* is RSC with constant L _1 > 0. 

Proof. Due to the strict convexity of /, the sup-problem in (|22j) has a unique solution, denoted by x(y), 
which satisfies 

0ey-df(x(y)). 

Also, /* is differentiable since / is strictly convex, and Vf*(y) = x{y). 

Consider problem min/*(y), which has solution set y* = {y : V/*(y) = 0} = {y : x(y) = 0} = 9/(0). 
"=^>" Pick y y* and let y prj = Projy,^) = Proj a/(0) (y) e y*. From y G df(x(y)), 

(V/*(y)- V/*(s/p rj ),y-j/prj) = (x(y),y-y pi} ) > - y P rj\\ 2 , 

where the last inequality follows from Definition [6l 

"<<=" arL y x o and p G df(x). Let y = p and j/ pr j = q = Proj a ^( )(p). Then, V/*(y) = x and 
V/*(y P rj) = 0. Then, 

L(p-q,x) = L(y-y pii ,Wf*(y) - V/*(y prj )) > \\y-y pr j\\ 2 = \\p- q\\ 2 , 

where the inequality follows from the definition of RSC. □ 

3 Main results 

This section derives the complexity bounds for the ordinary and accelerated gradient methods under RLG 
and/or RSC conditions; the derived complexities are summarized in Table 2. The bounds are presented for 
the following error quantities: 

1. Objective error: A k := /(x (fc) ) - /*, where /* = min xeRn f(x); 

2. Solution error: r k := ||ir (fe) - x p k ^\\ = min{||a; (fe) — X*|| : x* G X*}. 
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Table 2: Complexities of the new classes of functions 
3.1 Ordinary gradient descent 



Algorithm 1 Ordinary gradient descent method 

Input: Initialize £ M™ and select stepsize h > 0. 

1: for k = 0,1, ••• , do 

2: ilWl^iW^V/^'); 

3: end for 



Theorem 3 (Sublincar convergence for £fj(K n )). Assume that in problem (TT]) 7 / £ £ R (K. n ) with R > 0. 
TTien Algorithm^ with stepsize h £ (0,1/i?] converges sublinearly with 

where ro = Ha^ - 1 — Xprjll- it reaches e-accuracy (i.e., A& < e) in O(^) iterations. 

Proof. Firstly, we prove that is non-increasing and thus uniformly bounded by ro. From part 2) of Lemma 
[T]and h = a/R, where a £ (0, 1], we have 

h 2 \\Vf(xW)f = 2ah-^\\Wf(x^)\\ 2 < 2ah(Vf(x^),x^ - x<$) < 2h(Vf(x^),x^-x^), 
so in turn we get from x^ k+1 ^ = x^ — hV f(x^) that 

rl +1 = \\x^-x^f<\\x^-x^f (23-) 

= \\x^ - ^rj - hVf(x^)\\ 2 (23b) 
= \\x^ - agj || 2 - 2/ l <V/(x«), - x$>) + h 2 \\Vf(x^)\\ 2 < rt (23c) 

and r-fc < ro, Vfc. 

Next, by the convexity of /, (V f(x^), x^ - x*) > f(x k ) - f* > 0. Since r k < r , we have the bound 

||v/(x«)|| > ^l|v/(^)!| > l<v/(^ } )^ w -^)l > ^. 

ro r ro 

By part 1 ) of Lemma [T] we have 

A,+i < A fe + (Vf(xW),xW _ X W) + _ x (Vf 

= A k -h(l- f ^)\\Wf(x^)f 



h hR, 
'o 



<A,-^(1-^)A1. 
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For h = a/R, where < a < 1, £(1 - ^) = = O(-^t). Dividing the both sides of A fc+ i < 

A fc - 0(-±s)A 2 k by A k A k+1 , we get" (l/A fc+1 ) > (l/A?) + 0(^)° Therefore, A fc = 0(Rr%/k), following 
from which Afc < e is guaranteed in O(RrQfe) = 0(R/e) iterations. □ 

(Restricted) Lipschitz continuity of V/ alone cannot provide a decay rate for r^. In fact, r^ can decay 
arbitrarily slowly as function / becomes arbitrarily close to being flat near its minimizer. With the addtional 
RSC assumptions, the theorems below give geometrically-decaying bounds for both r k and Afc. 

Theorem 4 (linear convergence for TZr^). Assume that in problem (|Tj). / <E TZn :V (M. n ) with some R, v > 0. 
Then Algorithm^ with stepsize h = converges linearly with 

r * +1 5 <1 ~Tr ) " 2 ' ti " 

* S f«(l-£)'. 

It reaches e-accuracy in O (-p log i) iterations. 

Conversely, assuming that f has the unique solution x* and Algorithm starts from arbitrary has a 
finite stepsize h, linear convergence in the form of ||a; ( - fe+1 '' — x*\\ 2 < (1 — <y)||ar^ — x*\\ 2 for some < 5 < 1 
requires f to be RSC(v) for some v > 0. 

Proof. Recall that x^ is the projection of x^ onto the solution set X* and r^ = \\x^ — x^- 1|. Thus, 
V/(Xp^) = 0. For every G [0, 1] we have 

H^+i) _ ^+i)||2 < llx (k) _ S J) |,2 _ 2h(Vf(xW), z<*> - x%>) + h 2 \\Vf( X W) V/(43)H 2 (24a) 

< I^W _ X W||2 _ 2/l( A||v/(a;W) - V/(*g)|| 2 + (1 - 0)HI* (fc) - ^S]ll 2 ) (24b) 

+ / l 2 ||V/( 2 ;( fc ))-V/(4t J ) )|| 2 

~R 

where inequality ()24a[) follows from (|23f and inequality (|24b|) utilizes ([TT|l . We minimize (|24c|) over 9 and /i 
and obtain 6 = | and h = j^; the details can be found in Appendix. Then from (|24cj) we get 



(1 - 2(1 - - xgj|| 2 + (h 2 ^)||V/(z«) - V/(4g)f, (24c) 



ll^+i) < (i _ JLjHsW , (25) 

i.e., rfe+1 <(l-^)V2. rfc . 

By part 1) of LemmaHJ ^/(^prj) = ^- ancl < (1 — v jh) 1 ^ 2 ■ rk, we derive that 

Afc = /(*«) - /* < f ||s<*> - 4Sll 2 = frg < frg(l - ^) fc , (26) 

which shows Afc < ^ r 2 (l — ^) fc , following from which Afc < e is guaranteed in O (~ log i) iterations. 

Now, we show the converse result. Since / has the unique solution x* , we have x^f = x^J, = x* . 
Noticing x^ fc+1 ) = x^ - hVf(x^), we get 

|| :r (fc+D _ 3*112 = ^(fc) _ a.*||2 _ 2h(Vf(x^), x^ - x*) + h 2 \\Vf{x^) - Vf{x*)\\ 2 . 

From ||.T( fc+1) - x*\\ 2 < (1 — S)\\x^ - x*\\ 2 for some < S < 1, we have 

h 2 \\V f{x (k) ) ~V f{x*)\\ 2 - 2h{V f{x (k) ),x (k) - x*) < -6\\x {k) -x*\\ 2 , 

and consequently (V/(x (fc ^), x^ - x*) > ^||a; (fe) - x*\\ 2 after dropping h 2 \\V f (x^) - V/(x*)|| 2 > 0. As 
x^ is arbitrary, / is RSC(» with v = ^ > 0. □ 
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If RLG is strengthened to global Lipschitz continuity, we can take a possibly larger stepsize 1/L instead 
of l/(2i?) and have possibly better constants in the bound as follows. 

Theorem 5 (Linear convergence for 1Zl,v)- Assume that in problem ((T|), V/ is L-Lipschitz continuous and 
f is RSC(v) with L,v > 0. Then Algorithm^ with stepsize h = 1/L converges linearly with 

r k +i < (1 - v/L) 1 ' 2 ■ r k , 

A k <y 2 (l-v/L) k . 

It reaches e-accuracy in O (— log -) iterations. 

Proof. By replacing Lemma [T] with the following two Lemmas and repeating the arguments in Theorem 01 
the desired linear convergence rates can be derived. □ 

Lemma 4 ([7] Theorem 2.1.5). If f(x) G Jj,(R n ), it obeys 

f(x)<f(y) + (Vf(y),x-y) + ^\\x-y\\ 2 , Wx,y &R n ; (27) 

(Vf(x)-Vf(y),x-y) > V/(.t) - V/(y)|| 2 , Wx, y G R". (28) 

Lemma 5. Let X* be the nonempty solution set of ([T]). //V/ is L-Lipschitz continuous and f is RSC(v) 
with L 1 v > 0, then for every 9 £ [0, 1] the following holds: 

(V/(.t) - V/(z prj ),z - x prj ) > ^||V/(z) - V/(z prj )|| 2 + (1 - 9)v\\x - xprjll 2 . (29) 
where x pr j denotes the projection of x onto the solution set X* . 

Proof. Inequality ([29| follows from inequalities ([7|) and (|28|) . □ 



3.2 Accelerated gradient descent 



Algorithm 2 Nesterov's accelerated gradient method 

Input: Initialization ?/ 0) e R" , #o = 1, and h > 0. 
1: for fc = 0, 1, ■ • • , do 

2: x( fc+1 ) = j/ fe ) — hVf{y^); (negative gradient step) 

3: pk+i = (1 - 6> fe )( - 0*)/2; (extrapolation weight) 

4: = x<- k+1 > + /3 fc+ i(a;( fe+1 ) - aj(*>); (extrapolation) 

5: k +i = &k{y/6'k + 4 — 0k)/2; (dampening of acceleration parameter) 

6: end for 



Algorithm [2] is equivalent to Constant Step Scheme II on Page 80 of [7] (their a k = their 9 = 0) and 
FISTA on Page 193 of [2] without the nonsmooth regularization function g (their t k = l/^jQ). 

Theorem 6. Assume that in problem |T|), / 6 £n(M. n ) with R > 0. Then Algorithm^ with h = 1/R 
converges sublinearly with 

£ Vw^ ■ (30) 



It reaches e-accuracy in 0(y ~) iterations. 

1 Step 5 of Algorithm[2]satisfies = (1 — fe+1 )0^; plugging fe = l/t fe and = l/ifc+i, we obtain = (1 — £ fc+i )*aT' 

which gives step (4.2) in [2]. Also, equals in (4.3). 



-2 
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The proof below is self-contained and inspired by [T5]. Its 0(y ^) is better than O(-f) of Theorem [3l 
Proof. Sequences {9 k } and obey the following recursive relationships: 

Defining a; W = and i)( fc+1 ) = ^ + we can rewrite j/ fc+1 ) = 9 k+ iv ( - k+1 ^ + — 9 k+ i)x ( - k+1 K 

From part 1) of Lemma Q] and the convexity of /, for any z € 1" we have 

< f(y {k) ) + W(yW),xW> yW> + | - y^\\ 2 

< (/(*) + (V/(i/W), y« - z» + (V/(yW), - y<*>) + ^l^ 1 ) - „Wf 

< /(*) + (V/^),^ 1 ' - z) + lll^ 1 ) - 2/C fc )|| 2 

< /(z) + i?(x< fc+1 ) - yW,z - z^ 1 )) + - y^\\ 2 . 
Setting z = 6 k x* + (1 — 9 k )x^ k \ where x* <G X* , and using the convexity of /, we get 

/(x( fc+1 >) < 6 k f* + (l-9 k )f(x^) + R(x^ -y^\e k x* + (l-0 k ) x W - X W) + Z\\ X W - y W\\*. (32) 
Since 6 k x* + (1 - k )x^ - x (k+ ^ = 6 k {x* - v ( ~ k + 1 ^) and x( fe+1 ) - yW = fe («( fe+1 ) - we have 

iZ<arC*+i) - + (1 - 6 h )xW - x^) =R9 2 k (v^ - ,x* - v {k+1} ) 

=R6 2 k (v^ ~ x*,vW - x *) - RBlWvfi+V - x *\\ 2 

and 

|||x( fe+1 ) - y^\\ 2 = - x*|| 2 + ||«<*> - a;*|| 2 - 2( V ( fc+1 ) - z>« - x *)). (33) 

Substituting these equations into the last two terms of (|32[) . we get 

/(* (fc+1) ) < 6 k f* + (1 - O k )f(x^) - M||^+D _ x *f + M|| W W _ (34) 

Reordering the terms and dividing by 9 2 and then recursively deducing, we have 

^(/(a-CW-D) - /*) + |||„<*W) - *1 2 < 1^(/(,W) - /*) + -**|| 2 (35a) 

= J-(/(^ (fc) )-r) + |ib w -^n 2 (35b) 

<---<f(xW)-f* + *\\vW-x*\\ 2 (35c) 

where the last inequality follows from 9q = 1. Since — x^ and f(x^) — /* < -^||a;^ 1 - ) — x*\\ 2 from part 
1 ) of Lemma [I] we finally obtain 

/(x (fc+1) ) - /* < -**|| 2 < J^llaW - arSf- (36) 

Finally, we derive < ^ or ^ = 0,1,2,... from which the sublinear convergence rate (f3T)]) and its 
corresponding complexity will follow. From 9 = 1 and Step 5 of Algorithm [5J we have 9 k > 0. From 
y/0l+A > 2 and Step 5 again, we have %±±- > 2 -^ and thus ^ - 1 = > ± - \ = - 1) + \. 
Hence, for all k > 0, we have j- — 1 > | or 9 k < □ 
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Algorithm 3 Algorithm [2] with restarts 



Input: Initialization y(°>°) g M n ,#o = 1, restart interval K. 
1: for j = 0, 1, • • ■ , do 

2: obtain x^ ,K > by running Algorithm [2] for A' iterations; 
3: set a;0'+i.°) = x^), yO'+i.o) = a:0',-*0 and 6> = 1; 
4: end for 



Theorem 7. Assume that in problem (JTJ) , / G (R n ) wii/i some i? > 0, ^ > 0. TTien Algorithm^ with 



h = 1/R and K = ^J%eFLjv reaches e-accuracy in 0(y log i) iterations. 
Proof. At iteration j of Algorithm [3J we have 

4i? • II a;^ ) - a; (i ' 0) ll 2 8R 
/(x (i+i,o)) _ r = _ ;* < !L r2 112 < ^(/(^°) - /*) (37) 

where the first inequality follows from the convergence guarantee (|30[) of Algorithm [5] and the second from 
Lemma [3J After jK iterations, by the setting of K = yJ&eR/v we have 

/(*o-.°>) - r < - r) = - d (38) 

vK z e 

Thus, to obtain an e-solution, we only need to take j = C(log(l/e)) and hence the total number of iterations 
jK = C(y^"log i), which completes the proof. □ 

The above result and proof were motivated by [£j. Compared to [9] and |10) . we use weaker conditions. 

4 Application to augmented l\ minimization 

4.1 An improved convergence rate 



The augmented i\ model JT9J) returns an exact solution to 

min{||x||i : Ax = b} (39) 

provided that a in (|19p is large enough. For most problems where a sparse solution x* is expected from 
([39]) . such as those arising in compressive sensing, paper [4] argues that a = 10||x*||oo is sufficient. The 
Lagrange dual of p9[) . which is problem (|20[) . has an unconstrained and diffcrcntiablc objective function. 
By Example 5, the negative of the dual objective function, —f(y), satisfies RSC. In addition, / has an 
L-Lipschitz continuous gradient V/ with L = a||A|j 2 . Therefore, we can apply Theorems [5] and [7] to the 
ordinary and accelerated gradient iterations for (l20l) . 

The gradient ascent iteration for (f2"U)) is known as the linearized Bregman algorithm (LBreg): 

x {k+1) <- ashrink(A T j/ (fe) ), (40a) 

where x( fc ) and y^ are the primal and dual variables at iteration k and h > is the step size. One can 
verify that (6 — Ax^ k+1 ^) is the gradient to the objective of ([20]) . The solution set is given by 

y* = {y G R m : 6 - aAshrvak(A T y) = 0} = {y G R m : a shrink(A T 2 /) = x*} (41) 
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where x* is assumed to be the unique solution to (|19p ; the derivation can be found in [3]. 
Paper [4] shows 




Applying Theorem we obtain a tighter convergence bound: 

Theorem 8. In problem h20\) . assume that A € ]g> mxra an d g R m are nonzero and Ax = b are consistent. 
Let f* be the optimal objective value of i20\) . The linearized Bregman iteration starting from any 

t/ ' G W 1 with step size hk = \ generates a Q-linearly converging sequence {y^} 

ll^-ySHV^I 11 ^"^ 11 ' Vfc - L (42) 

The objective value converges R-linearly as 

r-f(y (k) )<^\y (0) -y^f(i-^) k , v*>i. (43) 

Furthermore, x^ k ' converges R-linearly as 




where x* is the solution to H9\) . The results are in the global sense. 

Proof. Due to (|40a[) . (|4T|) . the expression V/(j/) = b — a shrink(A T y), and the Lipschitz property of 
V/(j/), we have 

||x (fe+1 ) - x*|| = ||ashrink(AV fc) ) - ashrink(A T j/J])||, (45a) 

= HV/(yW)-V/(yg)||, (45b) 

<£||y (fc) -J#ill- (45c) 

which gives (|44|). The remained results follow from Theorem [5] applied to — /. □ 

4.2 Numerical simulation 

To demonstrate the convergence results, we compared the following algorithms for problem (|20l) : 

1. fixed-step gradient ascent (Algorithm Q} ; 

2. gradient ascent with Nesterov's acceleration (Algorithm [21 [3]); 

3. Nesterov's acceleration with restart (Algorithm 3] with restart); 

4. Nesterov's acceleration with skip (Algorithm 3] with skip). 

Although for (|20p we can compute K = ^/8eL/v using the lower bound of v given in Example 5 and thus 
run Algorithm [3] with restart every K iterations, such K was found too large. Instead, we ran Algorithm [4j 
which uses the following scheme to trigger restart as suggested in |10j (the inequality is given in the opposite 
directions for concave maximization): 

Gradient scheme: V/(y (fe " 1) ) T (?/ (fc) - y {k ~ 1] ) < 0. 

We also introduce the skip heuristic: set /3fc+i = (and make no change to 9k). 
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Algorithm 4 Nesterov's accelerated gradient method with reset 



Input: Initialization ?/ 0) e R n ,6o = 1, and /i > 0. 
1: for k = 0, 1, ■ • ■ , do 

2: = j/ fe ' - hWf(y^); (negative gradient step) 

3: If restart then 

4: fc = 1 and [5 k+1 = 0; 

5: elseif skip then 

6: h+i = 0; 

7: else 

8: p k+1 = (1 - 0fc)(^+4-0 fc )/2; (extrapolation weight) 
9: End if 

10: y( k + l ) = a;( fe + 1 ) + f} k+1 (x^ k+ ^ - zW); (extrapolation) 

11: 0fc+i = ^fe(\/ 6*| + 4 — 9k)/2; (dampening of acceleration parameter) 

6: end for 



The comparisons use two examples. Each had sparse signals x° with 512 entries, out of which 25 were 
nonzero entries sampled independently from the standard Gaussian distribution (Test 1, Figure Ufa)) or 
set to ±1 uniformly randomly (Test 2, Figure H|b)). Both examples have the same sensing matrix A 
with 256 rows and entries sampled independently from the standard Gaussian distribution. We used the 
following parameters: b = Ax°, a = 10||a; o || oo , and h = j; = ^ITJP • iterations were stopped upon 
|| Ac^) — 6|| < 10 _14 ||6||. Figure |4] depicts the relative error ^ x i i ^ ^ versus iteration k. 



Primal solution relative error 



Primal solution relative error 



1CT 



10" 



s io" 



10" 



10" 



200 















fixed step 




Nesterov 




— Nesterov + restart 




Nesterov + skip 











400 600 
iteration 



800 



1000 



■ fixed step 

- Nesterov 
Nesterov + restart 

- Nesterov + skip 




(a) Test 1: Gaussian sparse vector recovery 
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(b) Test 2: Bernoulli sparse vector recovery 



Figure 4: Relative error of primal variable 



The fixed-step gradient iteration converged very slowly in Test 1, much slower than in Test 2; this can 
be explained by a smaller v in Test 1 (see Lemma 7 of [3] for an explicit lower bound of v). The fixed-step 
iteration exhibited a linear-convergence behavior in Test 2 though we cannot tell the same from Test 1,. 

The accelerated gradient method performed similarly in both tests. Its performance was significantly 
improved in the second phase by restart and skip. In Test 1, skip was more effective. The two schemes did 
not appear to make much difference in these tests. It is interesting to note that in Test 2, both restart and 
skip had faster rates of convergence than the fixed-step gradient iteration; this deserves further tests and 
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perhaps theoretical investigation. 

As the focus of this paper is not numerical simulation, we do not present more numerical results. For the 
interested reader, the source code can be found on the second author's homepage. 



5 Conclusions 

The convergence behavior of gradient methods on convex differcntiablc functions is one of the core questions 
in convex optimization. It is known to many researchers that global Lipschitz continuity of V/ is more than 
sufficient for sublinear convergence and asking / to be strongly convex is also too much for linear convergence. 
For the ordinary and accelerated gradient methods, this paper shows using rather straightforward steps that 
these conditions restricted to certain line segments are sufficient for the existing convergence results to hold. 
In addition, it shows that strong convexity restricted to between current point x and its projection to the 
solution set is also necessary for the geometric decay of solution error. 

For the accelerated gradient method to achieve the best worst-case bound O ^y-^log 7^ on (restricted) 
strongly convex functions, the modulus v of the objective function must be given. This is not practical. It 
is an open question to design a method with this bound but not requiring the knowledge of v. On the other 
hand, the restart and skip heuristics appear to improve the performance of the accelerated method. 
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Appendix 

We select the parameter 9 and step size h in (|24c|) to minimize the upper bound. Let r = h > 0. As we 
need to deal with the second term in (|24c[) . two cases are studied below depending on the sign of h 2 — 

Case A: h 2 — ^ < 0, i.e., h G (0, |],9e [0, 1]. Applying the Cauchy-Schwartz inequality to RSI, we get 

HV/OzW) - V/0rg)l| 2 > A\* {k) ~4rjH 2 - ( 46 ) 
From h 2 — ^ < and (|24c|) . we derive that 

ll* (fc+1) ~ *Sf l) || a < (1 - 2(1 - e) V h)\\x^ - *« || 2 + v 2 (h 2 - - x^f, (47a) 

= (W - 2 ((1 - 9)v + ^) h + l) !M fe > - agj || 2 , (47b) 
±fi(0,h)\\* W -*Sll 2 - (47c) 

Let ho = ^ + ^77, which is the minimum point of the quadratic function /i(#, h) over variable h for each 
fixed 9. To determine whether such h is included in the interval (0, we consider h = ^ + = 77 
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and get 9 = j^. Now, we split the interval [0, 1] into [j^p, 1] and [0, If $ G Ii+r' 1]' we nave 5 — 

which means the point ho G (0, -^]. Thus, 

min fi(8,h)= min /i(0,ft o ) = min 1 - (1 - (1 + r)6) 2 = 1 - r 2 , 



where the minimum value 1 — r 2 is obtained at 6 = 1 and ft = ftg = 55. If G [0, jr-p )j we have < fto 
which means the point ho (0, -g]. By monotone decreasing of /i(0, ft) on the interval ft. < ^ for each fixed 



?, we have 



min fi(e,h)= min /i(0, -) = min 1 - 40(1 - 0)r = 1 - r 
h<-^,o<e<j^: o<e< T i 7 -R o<e< T i 7 



where the minimum value 1 — r is obtained at = i and ft = 4 = ;rW; note that ^ G [0, t^!— ) since r < 1. 

2 H 2H ' 2 L 7 1+r 7 

Therefore, on the intervals h G (0,-^] and 9 G [0,1], the minimum value 1 — r of /i(0, ft) is obtained at 
(0,/ 1 ) = (|,^). 

Case B: ft 2 — ^ > 0, i.e., ft G [-^, +oo), 6* G [0, 1]. Applying the Cauchy-Schwartz inequality to part 2) 
of Lemma [TJ we get 

||V/(*C fc ))- V/(xg])|| 2 < 4i? 2 ||xW -^]|| 2 . (48) 
From ft 2 - ^ > and (|24c|) . we derive that 

iM fc+1) - ^n 2 < (i - 2(1 - ^)ii* (fc) - 4Sn 2 + 4 ^ 2 - f )ii* (fc) - -Sn 2 > ( 49a ) 

= (4i? 2 ft 2 - 2(2#i? + (1 - 0»ft + l)||a; (fe) - agf, (49b) 
±f 2 (9,h)\\xW-xW.\\ 2 . (49c) 

Let fti = 2eR+ ^T e)v , which is the minimum point of the quadratic function f 2 (9, ft) over variable ft for each 
fixed 9. Similarly, we split the interval [0, 1] into {jrp, 1] and [0, jt^]- If 6 G {j^p, 1], we have > fti which 
means fti ^ [-^, +oo). By monotone increasing of /2(0, ft) on the interval ft > for each fixed 0, we have 

Q 

min f 2 (0, ft) = min f 2 (9, —) = min 1 - 40(1 - 9)r = 1 - r, 

where the minimum value 1 — r is obtained at 9 = 1/2 and ft = = note that ^ G (jt^, 1] since r < 1. 
If G [0, ], we have < fti which means fti G [Jj,+oo). Thus, 



where the minimum value is obtained at 9 = and ft = fti. After simple calculations, it holds r = ^ > 
( g^rp ) 2 an( i hence 1 — r < 1 — ( 2 ^ iy ) 2 . Therefore, on the intervals ft G [-^, +oo) and 9 G [0, 1], the minimum 



value 1 — r of /2(0, ft) is obtained at (0, ft) = (|, ^) as well. 
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